Popular Comments

Recent Discussion

2Buck Shlegeris
I don’t understand your last sentence, can you rephrase?
2Rohin Shah
You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
2Buck Shlegeris
I don't think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you'll always know the correct label for an action before you show it to the model. You might want to use something debate-like in the synthetic input generation process, but that's structurally different.

I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:

  1. You train a general-purpose reasoner (via debate), and simply prompt it to do untrusted monitoring. This could do better than a trained untrusted monitor because it is less likely to overfit to spurious features of synthetically generated dangerous actions. (You instead use the synthetically generated dangerous actions to validate your untrusted monitor, or to hill climb on your prompt.)
  2. You do train on synthetically generated dangerous actio
... (read more)
3Sam Marks
@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").
4Buck Shlegeris
I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.
3Sam Marks
The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)

This uses transformers, which is IIUC way less efficient for inference than e.g. vllm, to an extent that is probably unacceptable for production usecases.

Logical counterfactuals are when you say something like "Suppose  , what would that imply?"

They play an important role in logical decision theory.

Suppose you take a false proposition  and then take a logical counterfactual in which  is true. I am imagining this counterfactual as a function  that sends counterfactually true statements to 1 and false statements to 0.

Suppose  is " not Fermat's last theorem".  In the counterfactual where Fermat's last theorem is false, I would still expect 2+2=4. Perhaps not with measure 1, but close. So 

On the other hand, I would expect trivial rephrasing of Fermat's last theorem to be false, or at least mostly false.  

But does this counterfactual produce a specific counter example? Does it think that ? Or does it do something where the counterfactual insists a counter-example exists, but...

There is a model of bounded rationality, logical induction. 

Can that be used to handle logical counterfactuals?

1Donald Hobson
If two Logical Decision Theory agents with perfect knowledge of each other's source code play prisoners dilemma, theoretically they should cooperate.  LDT uses logical counterfactuals in the decision making. If the agents are CDT, then logical counterfactuals are not involved.
1Viliam
I guess it depends on how much the parts that make you "a little different" are involved in your decision making. If you can put it in numbers, for example -- I believe that if I choose to cooperate, my twin will choose to cooperate with probability p; and if I choose to defect, my twin will defect with probability q; also I care about the well-being of my twin with a coefficient e, and my twin cares about my well-being with a coefficient f -- then you could take the payout matrix and these numbers, and calculate the correct strategy. Option one, what if you cooperate. You multiply your payout, which is C-C with probability p, and C-D with probability 1-p; and also your twin's payout, which is C-C with probability p, and D-C with probability 1-p; then you multiply your twin's payout by your empathy e, and add that to your payout, etc. Okay, this is option one, now do the same for options two; and then compare the numbers. It gets way more complicated when you cannot make a straightforward estimate of the probabilities, because the algorithms are too complicated. Could be even impossible to find a fully general solution (because of the halting problem).
1Donald Hobson
  And here the main difficulty pops up again. There is no causal connection between your choice and their choice. Any correlation is a logical one. So imagine I make a copy of you. But the copying machine isn't perfect. A random 0.001% of neurons are deleted. Also, you know you aren't a copy. How would you calculate that probability p,q? Even in principle.

I hear a lot of different arguments floating around for exactly how mechanistically interpretability research will reduce x-risk. As an interpretability researcher, forming clearer thoughts on this is pretty important to me! As a preliminary step, I've compiled a list with a longlist of 19 different arguments I've heard for why interpretability matters. These are pretty scattered and early stage thoughts (and emphatically my personal opinion than the official opinion of Anthropic!), but I'm sharing them in the hopes that this is interesting to people

(Note: I have not thought hard about this categorisation! Some of these overlap substantially, but feel subtly different in my head. I was not optimising for concision and having few categories, and expect I could cut this down substantially with effort)

Credit to Evan...

  • If I could assume things like "they are much better at reading my inner monologue than my non-verbal thoughts", then I could create code words for prohibited things.
  • I could think in words they don't know.
  • I could think in complicated concepts they haven't understood yet. Or references to events, or my memories, that they don't know.
  • I could leave a part of my plans implicit, and only figure out the details later.
  • I could harm them through some action for which they won't understand that it is harmful, so they might not be alarmed even if they catch me thinkin
... (read more)

TL;DR: In September 2024, OpenAI released o1, its first "reasoning model". This model exhibits remarkable test-time scaling laws, which complete a missing piece of the Bitter Lesson and open up a new axis for scaling compute. Following Rush and Ritter (2024) and Brown (2024a, 2024b), I explore four hypotheses for how o1 works and discuss some implications for future scaling and recursive self-improvement.

The Bitter Lesson(s)

The Bitter Lesson is that "general methods that leverage computation are ultimately the most effective, and by a large margin." After a decade of scaling pretraining, it's easy to forget this lesson is not just about learning; it's also about search

OpenAI didn't forget. Their new "reasoning model" o1 has figured out how to scale search during inference time. This does not use explicit...

You might enjoy this new blogpost from HuggingFace, which goes into more detail.

2Jesse Hoogland
I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don't think the RL setup is actually that straightforward.  If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching. 
2Jesse Hoogland
The examples they provide one of the announcement blog posts (under the "Chain of Thought" section) suggest this is more than just marketing hype (even if these examples are cherry-picked): Here are some excerpts from two of the eight examples: Cipher: Science: 
1Jesse Hoogland
It's worth noting that there are also hybrid approaches, for example, where you use automated verifiers (or a combination of automated verifiers and supervised labels) to train a process reward model that you then train your reasoning model against. 
Load More